{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# INDIVIDUAL PROJECT: Predict the onset of diabetes based on diagnostic measures"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This case is from medical data analysis. In this field we have some properties that need to be adressed before even starting the project.\n",
    "Problems in medical data science are:\n",
    "1. Small datasets;\n",
    "2. Missing data;\n",
    "3. Corrupted data (false information: lie, flase positive/negative, bad testing etc)\n",
    "4. Only interpretable algorithms are viable (in some countries it is stated in law)\n",
    "\n",
    "So we need robust model (so it won't overfit and rely too heavy on the data) with interpretation that can handle missing data and small scale data.\n",
    "Сonsidering all of the above models that I will try to use classic algorithm (often used in med applications) - Bayes Network.\n",
    "\n",
    "About the [dataset](https://www.kaggle.com/uciml/pima-indians-diabetes-database):\n",
    "\n",
    "Diabetes Mellitus affects 422 million (dated 2014) people in the world or 8.5% of adult (over 18 years), causing 1.5–5.0 million per year (1.6 million caused by diabetes and 2.2 caused by high blood glucose in 2016).\n",
    "\n",
    "The population for this study was the Pima Indian population near Phoenix, Arizona. That population has been under continuous study since 1965 by the National Institute of Diabetes and Digestive and Kidney Diseases because of its high incidence rate of diabetes. Each community resident over 5 years of age was asked to undergo a standardized examination every two years,\n",
    "which included an oral glucose tolerance test. Diabetes was diagnosed according to World Health Organization Criteria; that is, if the 2 hour post-load plasma glucose was at least 200 mg/dl (11.1 mmol/l) at any survey examination or if the Indian Health Service Hospital serving the community found a glucose concentration of at least 200 mg/dl during the course of routine medical care. In addition to being a familiar database to the investigators, this data set provided a well validated data resource in which to explore prediction of the date of onset of diabetes in a longitudinal manner.\n",
    "\n",
    "Eight variables were chosen to form the basis for forecasting the onset of diabetes within five years in Pima Indian women (those variables were chosen because they have been found to be significant risk factors for diabetes among Pimas or other populations)\n",
    "\n",
    "1. Number of times pregnant\n",
    "2. Plasma Glucose Concentration at 2 Hours in an Oral Glucose Tolerance Test (GTIT)\n",
    "3. Diastolic Blood Pressure ($mmHg$)\n",
    "4. Triceps Skin Fold Thickness ($mm$)\n",
    "5. 2-Hour Serum Insulin ($\\mu U/ml$)\n",
    "6. Body Mass Index ($Weight(kg) / Height (m)^2$)\n",
    "7. Diabetes Pedigree Function\n",
    "8. Age (years)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Understanding the task"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Number of times pregnant: during pregnancy woman can develop gestational diabetes: have high blood sugar levels, but those levels were normal before pregnancy. After childbirth, gestational diabetes usually goes away. But gestational diabetes makes woman more likely to develop type 2 diabetes. [[source]](https://www.webmd.com/diabetes/gestational-diabetes-guide/gestational-diabetes)\n",
    "\n",
    "\n",
    "2. Plasma Glucose Concentration at 2 Hours in an Oral Glucose Tolerance Test (GTIT): a glucose tolerance test measures how well your body’s cells are able to absorb glucose (sugar) after you consume a specific amount of sugar. Doctors use fasting blood sugar levels and hemoglobin A1c values to diagnose type 1 and type 2 diabetes as well as prediabetes. A two-hour, 75-gram oral glucose tolerance test (OGTT) is used to test for diabetes or glucose tolerance. These are the charts to evaluate results [[source]](https://www.healthline.com/health/glucose-tolerance-test)\n",
    "\n",
    "   - For prediabetes: 140–199 mg/dL\n",
    "    \n",
    "   - For diabetes: 200 mg/dL or greater\n",
    "    \n",
    "   - For gestational diabetes: >153 mg/dL\n",
    " \n",
    "\n",
    "3. Diastolic Blood Pressure: high blood pressure, or hypertension, is a condition that’s seen in people with type 2 diabetes. It’s unknown why there’s such a significant relationship between the two diseases. It’s believed that the following contribute to both conditions: obesity, a diet high in fat and sodium, chronic inflammation, inactivity. [[source]](https://www.healthline.com/health/type-2-diabetes/hypertension)\n",
    "\n",
    "\n",
    "4. Triceps Skinfold Thickness: skin fold thickness measurement provides an estimated size of the subcutaneous fat, which is the layer of subcutaneous tissue and composed of adipocytes. Subcutaneous fat is the major determinant of insulin sensitivity and has a strong association with insulin resistance. However, evidence to predict,the effect of duration of diabetes on skin fold thickness remains unclear. [[source]](https://pdfs.semanticscholar.org/5d68/b7a7391272feb9a737f4d69539483deb2556.pdf)\n",
    "\n",
    "     - Control group has around 16.7\n",
    "     - group <5 years of diabetes 30.45\n",
    "     - group 5-10 years of diabetes 31\n",
    "     - group >10 years of diabetes 38.09.\n",
    "\n",
    "\n",
    "5. 2-Hour Serum Insulin: appears to be a good indicator of insulin resistance. It can be a useful tool, especially in low resource setting where a single sample can confirm the diagnosis, thus reducing cost and repeat visits. [[source]](https://search.proquest.com/openview/f218efe6291008a23ac0b3b57e332e60/1?pq-origsite=gscholar&cbl=226481)\n",
    "\n",
    "\n",
    "6. Body Mass Index (Weight in kg / (Height in m)^2) [[source]](https://en.wikipedia.org/wiki/Body_mass_index): \n",
    "\n",
    "    - Underweight = <18.5\n",
    "    \n",
    "    - Normal weight = 18.5–24.9 \n",
    "    \n",
    "    - Overweight = 25–29.9 \n",
    "    \n",
    "    - Obesity = BMI of 30 or greater\n",
    "\n",
    "\n",
    "7. Diabetes Pedigree Function (DPF): provide a synthesis of the diabetes mellitus history in relatives and the genetic relationship of those relatives to the subject. The DPF uses information from parents, grandparents, full and half siblings, full and half aunts and uncles, and first cousins. It provides a measure of the expected genetic influence of affected and unaffected relatives on the subject's eventual diabetes risk. **This function is not validated.** [[source]](https://europepmc.org/backend/ptpmcrender.fcgi?accid=PMC2245318&blobtype=pdf)\n",
    "\n",
    "\n",
    "8. Age (years): middle-aged and older adults are still at the highest risk for developing type 2 diabetes. According to the CDC’s 2017 National Diabetes Statistics Report, there were around 1.5 million new total diabetes cases among adults in 2015. In 2015, adults aged 45 to 64 were the most diagnosed age group for diabetes. New cases of both type 1 and type 2 diabetes in people aged 18 years and older were distributed as follows [[source]](https://www.healthline.com/health/type-2-diabetes-age-of-onset):\n",
    "\n",
    "    - ages 18 to 44: 355,000 new cases\n",
    "    \n",
    "    - ages 45 to 64: 809,000 new cases\n",
    "    \n",
    "    - age 65 and older: 366,000 new cases\n",
    "    \n",
    "   So the risk group is **45-64** years\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## EDA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import seaborn as sns\n",
    "from scipy import stats\n",
    "\n",
    "from matplotlib import pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "from pomegranate import BayesianNetwork\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import roc_auc_score, classification_report, recall_score, confusion_matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('diabetes.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Features Glucose, BloodPressure, SkinThickness, Insulin, BMI have minimum value equal to 0. That indicates an invalid or missing value. I will mark them as NaN, so my EDA will be cleaner and BN will understand that this number is missing and it needs to infer it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['Glucose'].replace(to_replace=0, value=np.NaN,  inplace=True)\n",
    "df['BloodPressure'].replace(to_replace=0, value=np.NaN,  inplace=True)\n",
    "df['SkinThickness'].replace(to_replace=0, value=np.NaN,  inplace=True)\n",
    "df['Insulin'].replace(to_replace=0, value=np.NaN,  inplace=True)\n",
    "df['BMI'].replace(to_replace=0, value=np.NaN,  inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After imputing NaN we have 392/768 objects to work with (half of the objects have missing values). It's not that bad for EDA, since I am looking for some general tendencies and I should find them (if there any) even in 50% of the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.pairplot(df.dropna())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this task I will use Bayesian Network from pomegranate. It supports only categorical values, no continuous features."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now looking closely to every feature.\n",
    "Pregnancy: feature easy to get (anamnesis). The only concern is inaccurate information. Maybe transforming this feature to simple Yes/No question won't hurt the model and will reduce the possibility of jeopardizing data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pregnancy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.hist(df[df['Outcome'] == 1]['Pregnancies'], color='red', alpha=0.3, bins=10);\n",
    "plt.hist(df[df['Outcome'] == 0]['Pregnancies'], color='green', alpha=0.5, bins=10);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's compare distributions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "stats.ks_2samp(df[df['Outcome'] == 1]['Pregnancies'], df[df['Outcome'] == 0]['Pregnancies'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Came from the same distribution, so we will have to split them manually. We can see the differences starts from 7 or so pregnancies. I will check thresholds for \"jump\" in class distributions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "threshold = 7\n",
    "df[df['Pregnancies'] > threshold]['Outcome'].value_counts(normalize=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "6 and 7 is good thresholds."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Glucose"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Glucose is our magic feature since it's the most accurate feature to predict diabetes in medicine (by doctors)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.hist(df[df['Outcome'] == 1]['Glucose'].dropna(), color='red', alpha=0.3, bins=20);\n",
    "plt.hist(df[df['Outcome'] == 0]['Glucose'].dropna(), color='green', alpha=0.5, bins=20);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Can be easily separated, they have different сentral tendencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('No diabetes: mean = {},'.format(df[df['Outcome'] == 0]['Glucose'].mean()) + \n",
    "      ' median = {},'.format(df[df['Outcome'] == 0]['Glucose'].median()) +\n",
    "      ' mode = {}.'.format(df[df['Outcome'] == 0]['Glucose'].mode()[0]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Diabetes: mean = {},'.format(df[df['Outcome'] == 1]['Glucose'].mean()) + \n",
    "      ' median = {},'.format(df[df['Outcome'] == 1]['Glucose'].median()) +\n",
    "      ' mode = {}.'.format(df[df['Outcome'] == 1]['Glucose'].mode()[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Domain knowledge threshold: 140 for pre-diabetes;\n",
    "\n",
    "Data driven threshold: 125"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[df['Glucose'] > 140]['Outcome'].value_counts(normalize=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[df['Glucose'] < 125]['Outcome'].value_counts(normalize=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Blood Pressure"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.hist(df[df['Outcome'] == 1]['BloodPressure'].dropna(), color='red', alpha=0.3, bins=20);\n",
    "plt.hist(df[df['Outcome'] == 0]['BloodPressure'].dropna(), color='green', alpha=0.5, bins=20);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "stats.ks_2samp(df[df['Outcome'] == 1]['BloodPressure'], df[df['Outcome'] == 0]['BloodPressure'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "They are literally the same. Let's try to separate them"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.scatter(df['BloodPressure'], df['Outcome'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[df['BloodPressure'] > 100]['Outcome'].value_counts(normalize=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Triceps Skin-Fold Thickness"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.hist(df[df['Outcome'] == 1]['SkinThickness'].dropna(), color='red', alpha=0.3, bins=20);\n",
    "plt.hist(df[df['Outcome'] == 0]['SkinThickness'].dropna(), color='green', alpha=0.5, bins=20);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Checking central tendencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('No diabetes: mean = {},'.format(df[df['Outcome'] == 0]['SkinThickness'].mean()) + \n",
    "      ' median = {},'.format(df[df['Outcome'] == 0]['SkinThickness'].median()) +\n",
    "      ' mode = {}.'.format(df[df['Outcome'] == 0]['SkinThickness'].mode()[0]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Diabetes: mean = {},'.format(df[df['Outcome'] == 1]['SkinThickness'].mean()) + \n",
    "      ' median = {},'.format(df[df['Outcome'] == 1]['SkinThickness'].median()) +\n",
    "      ' mode = {}.'.format(df[df['Outcome'] == 1]['SkinThickness'].mode()[0]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "warnings.simplefilter('ignore', UserWarning)\n",
    "ns = []\n",
    "for n in np.arange(50, 80, 1):\n",
    "    ns.append(df[df['SkinThickness'] > n][df['Outcome'] == 1]['Outcome'].size / (df[df['SkinThickness'] > n]['Outcome'].size))\n",
    "plt.plot(np.arange(50, 80, 1), ns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "warnings.simplefilter('ignore', UserWarning)\n",
    "df[df['SkinThickness'] > 55][df['Outcome'] == 1]['Outcome'].size / (df[df['SkinThickness'] > 55]['Outcome'].size)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Insulin"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.hist(df[df['Outcome'] == 1]['Insulin'].dropna(), color='red', alpha=0.3, bins=30);\n",
    "plt.hist(df[df['Outcome'] == 0]['Insulin'].dropna(), color='green', alpha=0.5, bins=30);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Diabetes: mode = {}'.format(df[df['Outcome'] == 1]['Insulin'].mean()))\n",
    "print('No diabetes: mode = {}'.format(df[df['Outcome'] == 0]['Insulin'].mean()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[df['Insulin'] > 205]['Outcome'].value_counts(normalize=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### BMI"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.hist(df[df['Outcome'] == 1]['BMI'].dropna(), color='red', alpha=0.3, bins=30);\n",
    "plt.hist(df[df['Outcome'] == 0]['BMI'].dropna(), color='green', alpha=0.5, bins=30);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Diabetes: mean = {},'.format(df[df['Outcome'] == 1]['BMI'].mean()) + \n",
    "      ' median = {},'.format(df[df['Outcome'] == 1]['BMI'].median()) +\n",
    "      ' mode = {}.'.format(df[df['Outcome'] == 1]['BMI'].mode()[0]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('No Diabetes: mean = {},'.format(df[df['Outcome'] == 0]['BMI'].mean()) + \n",
    "      ' median = {},'.format(df[df['Outcome'] == 0]['BMI'].median()) +\n",
    "      ' mode = {}.'.format(df[df['Outcome'] == 0]['BMI'].mode()[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Checking domain knowledge: our threshold is 30, let's use it on both ways."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[df['BMI'] < 30]['Outcome'].value_counts(normalize=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Diabetes Pedegree Function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.hist(df[df['Outcome'] == 1]['DiabetesPedigreeFunction'].dropna(), color='red', alpha=0.3, bins=30);\n",
    "plt.hist(df[df['Outcome'] == 0]['DiabetesPedigreeFunction'].dropna(), color='green', alpha=0.5, bins=30);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.scatter(df['DiabetesPedigreeFunction'], df['Outcome'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[df['DiabetesPedigreeFunction'] >= 1.1]['Outcome'].value_counts(normalize=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This feature is not validated so we need to treat it carefully, maybe it will just bring the noise to the model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Age"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.hist(df[df['Outcome'] == 1]['Age'].dropna(), color='red', alpha=0.3, bins=30);\n",
    "plt.hist(df[df['Outcome'] == 0]['Age'].dropna(), color='green', alpha=0.5, bins=30);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.scatter(df['Age'], df['Outcome'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using domain knowledge (it gives best separation between clsses judging by the histogramm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[(df['Age'] > 40) & (df['Age'] < 65)]['Outcome'].value_counts(normalize=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Feature Engineering: QUICKI"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's create a surrogate index for predicting diabetes. [QUICKI](http://diabetes.diabetesjournals.org/content/54/7/1914.full-text.pdf) is kinda simple, accurate and meaningful metric, why not use it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['QUICKI'] = 1 / (np.log(df['Insulin']) + np.log(df['Glucose']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.hist(df[df['Outcome'] == 1]['QUICKI'].dropna(), color='red', alpha=0.3, bins=30);\n",
    "plt.hist(df[df['Outcome'] == 0]['QUICKI'].dropna(), color='green', alpha=0.5, bins=30);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[(df['QUICKI'] <= 0.1)]['Outcome'].value_counts(normalize=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Creating binary features and saving NaNs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['Pregnancy_risk'] =  (df['Pregnancies'] > 7).astype(int) #data driven, 7 gives better separation of classes\n",
    "df['Pregnancy_risk'][df['Pregnancies'].isnull()] = np.NaN\n",
    "df['Glucose_tolerance'] = (df['Glucose'] > 140).astype(int) #domain knowledge\n",
    "df['Glucose_tolerance'][df['Glucose'].isnull()] = np.NaN\n",
    "df['TSKT_risk_group'] =  (df['SkinThickness'] > 55).astype(int) #data driven\n",
    "df['TSKT_risk_group'][df['SkinThickness'].isnull()] = np.NaN\n",
    "df['Insulin_resistance'] = (df['Insulin'] > 205).astype(int) #data driven\n",
    "df['Insulin_resistance'][df['Insulin'].isnull()] = np.NaN\n",
    "df['Obesity'] =  (df['BMI'] < 27).astype(int) #data driven, model gives better results than with >30\n",
    "df['Obesity'][df['BMI'].isnull()] = np.NaN\n",
    "df['Age_risk'] = ((df['Age'] > 40) & (df['Age'] < 65)).astype(int) #domain knowledge\n",
    "df['Age_risk'][df['Age'].isnull()] = np.NaN\n",
    "df['DPF_risk'] = (df['DiabetesPedigreeFunction'] >= 1.1).astype(int) #data driven\n",
    "df['DPF_risk'][df['DiabetesPedigreeFunction'].isnull()] = np.NaN\n",
    "df['BP_risk'] = (df['BloodPressure'] > 100).astype(int) #data driven\n",
    "df['BP_risk'][df['BloodPressure'].isnull()] = np.NaN\n",
    "df['QUICKI_results'] = (df['QUICKI'] <= 0.1).astype(int) #domain knowledge\n",
    "df['QUICKI_results'][df['QUICKI'].isnull()] = np.NaN"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "work_features = ['Pregnancy_risk', 'Glucose_tolerance', 'TSKT_risk_group', 'Insulin_resistance', 'Obesity', 'Age_risk', 'DPF_risk', 'BP_risk', 'QUICKI_results']\n",
    "work = df[work_features]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(work, df['Outcome'], test_size=0.3, random_state=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Modeling"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = BayesianNetwork.from_samples(pd.concat([X_train, y_train], axis=1), algorithm='exact')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pred = np.array(model.predict(np.concatenate([X_test.values, np.full((231,1), None)], axis=1)))[:,-1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For our final model I will use classification report since it gives more than enough information about results. \n",
    "\n",
    "Recall is more important here since we need to predict diabetes not diagnose it. If we will give false positive result it won't be that bad since preventive measures won't hurt patiens (more active lifestyle, less carbohydrates, checking BP and glucose levels, blood sugar controling diet etc).\n",
    "\n",
    "For probability predictions I will use ROC AUC."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(classification_report(y_test, pred.astype(int), labels=[0,1], target_names=['No Diabetes', 'Diabetes']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "confusion_matrix(y_test, pred.astype(int))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "probes = []"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for i in np.arange(0, len(X_test)):\n",
    "    probes.append(model.predict_proba(np.concatenate([X_test.iloc[i], [None]]))[-1].items()[1][1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "roc_auc_score(y_test, probes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Model tuning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To chose features for final model I will use kinda dumb brute force: I will drop the feature and see the results. If results are better - feature should be dropped to improve our model. Hope it will converge to something meaningfull"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for feature in work_features:\n",
    "    print('Dropping ' + feature + ':')\n",
    "    X_train, X_test, y_train, y_test = train_test_split(work.drop(feature, axis=1), df['Outcome'], test_size=0.3, random_state=1)\n",
    "    model = BayesianNetwork.from_samples(pd.concat([X_train, y_train], axis=1), algorithm='exact')\n",
    "    pred = np.array(model.predict(np.concatenate([X_test.values, np.full((231,1), None)], axis=1)))[:,-1]\n",
    "    print('recall - {:.5f}'.format(recall_score(y_test, pred.astype(int))))\n",
    "    probes = []\n",
    "    for i in np.arange(0, len(X_test)):\n",
    "        probes.append(model.predict_proba(np.concatenate([X_test.iloc[i], [None]]))[-1].items()[1][1])\n",
    "    print('ROC AUC - {:.5f}'.format(roc_auc_score(y_test, probes)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "QUICKI was not so good (correlated with Glucose and Insulin plus has a lot more NaNs since we used 2 features with NaNs to get it), it improves our ROC but crushes recall. Ok, drop it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "work_features = ['Pregnancy_risk', 'Glucose_tolerance', 'TSKT_risk_group', 'Insulin_resistance', 'Obesity', 'Age_risk', 'DPF_risk', 'BP_risk']\n",
    "work = df[work_features]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for feature in work_features:\n",
    "    print('Dropping ' + feature + ':')\n",
    "    X_train, X_test, y_train, y_test = train_test_split(work.drop(feature, axis=1), df['Outcome'], test_size=0.3, random_state=1)\n",
    "    model = BayesianNetwork.from_samples(pd.concat([X_train, y_train], axis=1), algorithm='exact')\n",
    "    pred = np.array(model.predict(np.concatenate([X_test.values, np.full((231,1), None)], axis=1)))[:,-1]\n",
    "    print('recall - {:.5f}'.format(recall_score(y_test, pred.astype(int))))\n",
    "    probes = []\n",
    "    for i in np.arange(0, len(X_test)):\n",
    "        probes.append(model.predict_proba(np.concatenate([X_test.iloc[i], [None]]))[-1].items()[1][1])\n",
    "    print('ROC AUC - {:.5f}'.format(roc_auc_score(y_test, probes)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some features won't effect our model, we can drop them to reduce noise. For example TSKT"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "work_features = ['Pregnancy_risk', 'Glucose_tolerance', 'Insulin_resistance', 'Obesity', 'Age_risk', 'DPF_risk', 'BP_risk']\n",
    "work = df[work_features]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for feature in work_features:\n",
    "    print('Dropping ' + feature + ':')\n",
    "    X_train, X_test, y_train, y_test = train_test_split(work.drop(feature, axis=1), df['Outcome'], test_size=0.3, random_state=1)\n",
    "    model = BayesianNetwork.from_samples(pd.concat([X_train, y_train], axis=1), algorithm='exact')\n",
    "    pred = np.array(model.predict(np.concatenate([X_test.values, np.full((231,1), None)], axis=1)))[:,-1]\n",
    "    print('recall - {:.5f}'.format(recall_score(y_test, pred.astype(int))))\n",
    "    probes = []\n",
    "    for i in np.arange(0, len(X_test)):\n",
    "        probes.append(model.predict_proba(np.concatenate([X_test.iloc[i], [None]]))[-1].items()[1][1])\n",
    "    print('ROC AUC - {:.5f}'.format(roc_auc_score(y_test, probes)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our iterative proccess stops here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "work_features = ['Pregnancy_risk', 'Glucose_tolerance', 'Insulin_resistance', 'Obesity', 'Age_risk', 'DPF_risk', 'BP_risk']\n",
    "work = df[work_features]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(work, df['Outcome'], test_size=0.3, random_state=1)\n",
    "model = BayesianNetwork.from_samples(pd.concat([X_train, y_train], axis=1), algorithm='exact')\n",
    "pred = np.array(model.predict(np.concatenate([X_test.values, np.full((231,1), None)], axis=1)))[:,-1]\n",
    "probes = []\n",
    "for i in np.arange(0, len(X_test)):\n",
    "    probes.append(model.predict_proba(np.concatenate([X_test.iloc[i], [None]]))[-1].items()[1][1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "confusion_matrix(y_test, pred.astype(int))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(classification_report(y_test, pred.astype(int), labels=[0,1], target_names=['No Diabetes', 'Diabetes']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "roc_auc_score(y_test, probes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Playing with PR curve: maybe we can get some \"free lunch\" here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import precision_recall_curve, average_precision_score\n",
    "from sklearn.utils.fixes import signature"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "precision_recall_curve(y_test, probes, pos_label=1)\n",
    "average_precision = average_precision_score(y_test, probes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "precision, recall, threshold = precision_recall_curve(y_test, probes, pos_label=1)\n",
    "\n",
    "# In matplotlib < 1.5, plt.fill_between does not have a 'step' argument\n",
    "step_kwargs = ({'step': 'post'}\n",
    "               if 'step' in signature(plt.fill_between).parameters\n",
    "               else {})\n",
    "plt.step(recall, precision, color='b', alpha=0.2,\n",
    "         where='post')\n",
    "plt.fill_between(recall, precision, alpha=0.2, color='b', **step_kwargs)\n",
    "\n",
    "plt.xlabel('Recall')\n",
    "plt.ylabel('Precision')\n",
    "plt.ylim([0.0, 1.05])\n",
    "plt.xlim([0.0, 1.0])\n",
    "plt.title('2-class Precision-Recall curve: AP={0:0.2f}'.format(\n",
    "          average_precision))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):\n",
    "    \"\"\"\n",
    "    Modified from:\n",
    "    Hands-On Machine learning with Scikit-Learn\n",
    "    and TensorFlow; p.89\n",
    "    \"\"\"\n",
    "    plt.figure(figsize=(20, 10))\n",
    "    plt.title(\"Precision and Recall Scores as a function of the decision threshold\")\n",
    "    plt.plot(thresholds, precisions[:-1], \"b--\", label=\"Precision\")\n",
    "    plt.plot(thresholds, recalls[:-1], \"g-\", label=\"Recall\")\n",
    "    plt.ylabel(\"Score\")\n",
    "    plt.xlabel(\"Decision Threshold\")\n",
    "    plt.legend(loc='best')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_precision_recall_vs_threshold(precision, recall, thresholds=threshold)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(classification_report(y_test, np.int32(np.array(probes) >= 0.3)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "confusion_matrix(y_test, np.int32(np.array(probes) >= 0.3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Conclusion: advantages and disadvantages, ways to imporve model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pluses:\n",
    "1. Simple to explain;\n",
    "2. Has interpreation;\n",
    "3. Fast infer;\n",
    "4. Missing data is not a problem;\n",
    "5. Supports online training.\n",
    "\n",
    "Minuses:\n",
    "1. This implementation can only work with binary features (specail software can build BN with continuous features);\n",
    "2. Dependance on data (since we chose thresholds for binarization on our features not only from domain knowledge but from data too; continuous BN would solve this problem).\n",
    "3. Still uses some features that are obtained from invasive methods (can use different features or technology to obtain data);\n",
    "4. Needs bigger and better data to get better at generalizing (BN still got it nice though).\n",
    "\n",
    "Bayesian Network was one of the best model for medical tasks (Pathfinder I, II, III; Babylon Health etc.) and still is. It can work with small sample size, missing data, prone to overfitting and has interpretability.\n",
    "But with the development of the neurobayes approaches, we will be able to see new models for medical diagnosis, health data analysis, individual medicine and decision support systems in the near future."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
